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1 Introduction 


As our world continues to technologically advance, machine learning has taken on a role 
of propelling the future. One fascinating machine learning technique is clustering (also 
known as cluster analysis). Clustering is a process of discovering patterns in unlabeled 
data (data that has not been tagged with identification [1]), and aims to group individ- 
ual objects based on their degree of similarity from one another [2]. Clustering can be 
applied to many aspects of the real world: from grouping customers based on their be- 
havioral psychology to grouping different types of wine (a data set that will be explored 


in this investigation) - clustering results are influential in all disciplines. 


Within clustering algorithms, there are many configurable parameters that affect the 
overall performance. It is vital to understand the differences in performance of each al- 
gorithm when certain parameters are customized in order to maximize the effectiveness 


of clustering for a given unlabeled data set. 


This paper seeks to evaluate the effectiveness in performance, measured through sil- 
houette score and the number of iterations of three configurable properties of k-means 
clustering (i.e., initial placement, number of clusters, number of features). K-means 
clustering serves to extract value from large unlabeled data sets. This gives the results of 
this research the potential to improve the efficiency of clustering applications. By under- 
standing how certain parameters of k-means clustering can be configured to maximize 
the effectiveness, industries such as business would benefit greatly from better client, 
product, and data clustering for their operations. 

The following research question will be explored: To what extent is the performance 
of the k-means clustering algorithm in unsupervised learning influenced by the ini- 
tial placement algorithm, the number of features, and the number of clusters? For 
this investigation, k-means clustering algorithms were programmed to group data from 


a synthetic data set and a public wine data set. For each data set, the initial placement, 


number of clusters, and number of iterations were altered at each rerun. Patterns were 
analyzed and performance was evaluated through calculated silhouette scores and the 
number of iterations needed to complete the process. This investigation will also de- 
termine whether the metric used, silhouette score, is a reliable determinant of accuracy 
of an unsupervised learning algorithm. Logical and mathematical explanations for the 


results obtained are discussed. 


2 Background Information 


2.1 Supervised and Unsupervised Learning 


Machine learning algorithms have two main approaches - supervised and unsupervised 
learning. Supervised learning algorithms refers to working with labeled data sets to train 
and "supervise" algorithms in processing data. Since input and output data are labeled, 
the supervised learning model can easily measure accuracy. Classification and regres- 
sion algorithms are the most common types trained by supervised learning, due to their 


nature of reliance on a labeled data set [3]. 


Unsupervised learning discovers hidden patterns without the need of human interac- 
tion or labeled data sets. The main tasks associated with unsupervised learning are 


clustering, association, and dimensionality reduction. 


2.2 K-Means Clustering 


This paper will specifically explore the k-means clustering algorithm, a method of vector 


quantization that originally stems from signal processing. 


Given a set of observations (x, £2, ..., £n), where each observation is a d-dimensional 
real vector, k-means clustering aims to partition the n observations into k (< n) sets 


S = S, So, ..., Sk so as to minimize the within-cluster sum of squares (i.e., variance). 
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Since k-means clustering is computationally difficult [4], a heuristic iterative refinement 
technique was introduced, in which k is pre-defined prior to the clustering process. In 
this investigation, the number of clusters k will be one of the three configurable param- 


eters. 


The approach k-means follows to solve the problem is called Expectation-Maximization. 
The E-step (expectation) is assigning the data points to the closest cluster. The M-step 
(maximization) is computing the centroid of each cluster. Here is a rundown of how 


k-means operates: 
1. Specify number of clusters k. 


2. Initialize centroids by first shuffling the data set and then randomly selecting k 


data points for the centroids without replacement. 


3. Keep iterating until all stopping criteria are met. As k-means is an iterative pro- 
cess, it is crucial to understand when to stop the algorithm. Essentially, the three 
stopping criteria are when centroids of newly formed clusters do not change, when 
points remain in the same cluster, and when the maximum number of iterations is 


reached [5]. 
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Figure 1: Example of unlabelled data going through the k-means Expectation- 
Maximization process [6] 


2.2.1 Parameters 


In this paper, the effects of three different parameters on the performance of k-means 
processes will be investigated across two data sets - a synthetic data set and a real data 


set containing the chemical properties of certain wine types. 


2.2.2 First configurable parameter: initial placement 


The first altered parameter will be the initial placement, which will be set to either 
random or k-means++. This refers to the initial placement of the clusters in the k- 
means clustering process. A random initial placement means that the center-points of 
clusters are randomly chosen. K-means+-+ is a biased random sampling that chooses 
centers farther apart from one another, avoiding close points; it aims to achieve the 
optimal clustering results in a fewer number of iterations. The first chosen centroid of 
k-means++ is random, and the next centroids are chosen as the datapoints with the 


largest squared distance from the first chosen centroid. 


(a) Distance from each centroid (b) Blue datapoint is chosen as 
third centroid 


Figure 2: Determining third centroid [5] 


The above figure demonstrates the use of k-means- + to determine the third centroid of 
a set of datapoints. The square of the distances of each datapoint from its closest centroid 
(green or red) is calculated, and the blue datapoint is selected as the third centroid since 


it has the largest squared distance from its nearest centroid in Figure 2a. 


2.2.3 Second configurable parameter: number of clusters 


The second configurable parameter in this investigation is the number of clusters. There 
is no limit to how many clusters can be formed in k-means clustering. We will be deter- 
mining the optimum combination of three configured parameters with the synthetic and 


wine data sets in this investigation. 


2.2.4 Third configurable parameter: number of features 


The final parameter to be configured in this investigation is the number of features. The 
number of features can vary greatly for real world data sets. In the context of students 
at a school, features include nationality, gender, grades, household income, etc. This 
is an interesting area of exploration, since on the surface level it may seem that more 
features makes it easier to find similarities and establish clusters. However, it could also 


be harder because the conditions of similarity become much more nuanced. 


2.2.5 Feature Scaling 


Feature scaling is an important step to take prior to processing data for many ma- 
chine learning algorithms. It is implemented through standardization, which re-scales 
the features to reflect the properties of a standard normal distribution. This is vital in 
many algorithms as they may behave badly if individual features do not represent nor- 
mally distributed data. For example, if an investigation aims to describe the physical at- 
tributes people and the data provided includes their heights in centimeters and weights 
in pounds, a five pound difference cannot directly be compared to a five centimeter dif- 


ference in height. 


Features are standardized by removing the mean and scaling to unit variance. As an 
example, the standard score of a sample x is calculated as: z = (x — u)/s, where u is the 
mean of the training samples and the variable s is the standard deviation of the training 


samples. 


2.2.6 Dimensionality Reduction Through principal Comoponent Analysis (PCA) 


One instance feature scaling is used is during Principal Component Analysis, or PCA. PCA 
is a dimensionality reduction method typically used to reduce the feature dimensionality 
of large data sets. This is done by transforming a data set with many variables into a 
smaller one with less variables but is still able to capture most of the information of the 
original large data set. Simply put, the goal of dimensionality reduction methods such 
as PCA is to decrease the number of variables of a data set while preserving as much 


information as possible [7]. 


To better understand PCA, refer to the graph below, Figure 3. There are 10 principal 
components seen, meaning the original data set is 10-dimensional, having 10 features/- 
variables. principal components are essentially crafted as combinations of all ten of the 
variables. They are mixed in such a way that most information of variables is compressed 
into the first few principal components (as represented by the highest percentage of ex- 


plained variances being in principal component 1). 


Percentage of explained variances 


1 2 3 4 5 6 7 8 9 10 
Principal Components 


Figure 3: Principal component analysis [7] 


One obvious issue of lowering the number of variables in the data set is that accuracy 
will be negatively affected. However, the intent of dimensionality reduction methods 
are to sacrifice a little accuracy for simplicity. This is because smaller data sets without 
extraneous variables are easier to investigate, making the visualization and analyzing 


processes of machine learning algorithms much easier, faster, and more streamlined. 
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Figure 4: Visualization of feature scaling through principal component analysis [8] 


As mentioned in the previous subsection, feature scaling standardizes the features of 
a data set. In order to effectively execute PCA, feature scaling is required in order to 
better compare variables and determine which to reduce. In Figure 4 above, there are 
supposedly three classes, and the machine learning algorithm is aiming to cluster these 
three classes accordingly. If successful, there should be a clear distinction between the 
three classes (represented by three different colors and shapes). An example of PCA 
without feature scaling is shown on the left, with two principal components (data set 


variables) selected on the x and y axes. Each entity that belongs to different classes are 
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mixed together, demonstrating a failed attempt at clustering. However, after undergoing 
feature scaling, as shown on the right, the PCA proves to be much more effective. There 
is a clear distinction that can be seen between entities of each of the three classes. The 
feature-scaled version on the right greatly outperforms the non-scaled version on the left, 


emphasizing the importance of feature scaling in dimensionality reduction through PCA. 


3 Methodology 


Primary experimental data is the main source of data in this paper. Two data sets (a 
synthetic and a wine data set) were used to complete a k-means clustering process (code 
in appendix, adapted from an example from Scikit-learn [9]). The number of itera- 
tions taken to run each configured program was recorded and accuracy was displayed 
by silhouette score. This investigation took an experimental approach because there was 
limited secondary data to answer the research question. The chosen approach allows 
independent variables to be easily manipulated. Since an experimental approach was 


taken, the results of the experiment are technically limited to the scope of the procedure. 


The hardware configuration used was an Apple MacBook Air (M1, 2020) with 16GB 
Memory. The software package used in the code was Python 3.9.0 and scikit-learn 


1.1.1. 


3.1 Data Sets Used 
3.1.1 Synthetic data set 


The synthetic data set used in this investigation generates the sample data from the 
make blobs Python function. This particular setting has one distinct cluster and 3 clus- 
ters placed close together. Below is the source code used that generates the synthetic 


data set. 
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X, y = make blobs( 
n. samples-1000, 
n features- 20, 
centers-4, 
cluster std-1, 
center box-(-10.0, 10.0), 
shuffle-True, 


random state-1, 


3.1.2 Wine data set 


The wine data set includes 3 classes, with each class containing 59, 71, and 48 samples, 
respectively. For each row, there are 13 real and positive features. The 13 features are 
Alcohol, Malic acid, Alkalinity of ash, Magnesium, Total phenols, Flavanoids, Nonfla- 
vanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/0D315 of diluted wines, 


and Proline [10]. 


alcohol malic acid ash ues hue 0d280/0d315 of diluted wines proline 
0 14.23 1.713. 2.43 1.04 3492 1065.0 
1 13.20 1.78. 2.14 1,05 3.40 1050.0 
2 13.16 2:30 2.07 1.03 3.7 1185.0 
3 14.37 1:95 2;50 0.86 3.45 1480.0 
4 13.24 2.59 2.87 1.04 2.93 735.0 


Figure 5: First five rows of wine data set 


3.2 Evaluation metrics 
3.2.1 Silhouette score 


The metric of accuracy used in this paper is silhouette score, mainly used to evaluate the 
quality of clusters created. Silhouette score is calculated at each data point, and requires 
the mean distance between the observation point and all other data points in the same 


cluster. This is known as mean intra-cluster distance. In the following equation, the mean 
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intra-cluster distance is denoted by a, the mean nearest-cluster distance is denoted by b 


and the Silhouette score denoted by S. 


The range of silhouette scores is between -1 and 1. A score of 1 means that the cluster is 
itself dense and well-separated from other clusters. A value of 0 represents overlapping 
clusters, with their samples extremely close to the boundary of neighboring clusters. 
A negative score indicates inaccuracy, suggesting that the datapoints may have been 


assigned to the wrong cluster [11]. 


00 02 04 06 08 00 02 04 06 08 


Figure 6: Example of silhouette analysis for 2,3,4,5 clusters [11] 


Above is a visualization of Silhouette analysis done on 2, 3, 4, and 5 clusters. The av- 
erage silhouette score is indicated by the vertical dotted line. The Silhouette scores of 
clusters 4 and 5 are sub-optimal for the given data set due to the presence of clusters 
with silhouette scores that are below-average and wide fluctuations in the size of the 


silhouette plots. 


The silhouette score values for clusters 2 and 3 look relatively optimal. The score for 


each cluster is above the average silhouette score and there is minimal fluctuation in 
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size. 


Silhouette plots with clusters having uniform thicknesses is an indication of the optimal 
number of clusters. The top right plot with 3 clusters have the most uniform thicknesses 


out of all four plots. Thus, the optimal number of clusters in the above figure is 3. 


4 Experimental results 


4.1 Table of Synthetic data set Results 


The following table displays the experimental results of k-means clustering using the 
synthetic data set. Silhouette scores are displayed to four significant figures in order to 


maintain high accuracy, as differences between some values were rather minimal. 


Silhouette Score Number of iterations 
Number of Initial 


Number of features Number of features 
clusters placement 


5 10 15 20 5 10 


Random 0.6182 | 0.6381 | 0.599 |0.6378| 14 | 12 | 10 
Random 0.4610 0.4630 0.4245 0.4397| 14 13 20 


Figure 7: Synthetic data set experiment results 
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4.2 Table of Wine data set Results 


The following table displays the experimental results of k-means clustering using the 


wine data set. 


2 Silhouette Score Number of iterations 
Numer or Dun Number of features Number of features 
5 10 
Random 0.3145 | 0.2373 | 0.2655 | 0.2837 
K-means-- | 0.3145 0.2373 0.2655 0.2837 
Random 0.3590 0.2585 0.2339 0.2594 
K-meanst++ | 0.3590 0.2585 0.2333 0.2596 
Random 0.3508 | 0.2413 0.2243 0.2071 


^ K-means++ | 03508 0.2493 0.2289 02217 
Random | 0.3121 02321 0.1744 0.1834 

0.3256 0.2293 0.2008 0.1827 
Random | 0.3149 0.2445 0.1750 0.1540 

0.3168 02338 0.1773 0.1983 


Figure 8: Wine data set experiment results 


clusters placement 


4.3 Example of Programmed Outcome 


In order to visualize some results of the code, the following two figures are the produced 
charts of the program. Figure 9 depicts the synthetic data set results with k-means- + 
initial placement, 5 features, and 4 clusters. The results depicted are optimal, since all 
four clusters in the left chart are almost of equal size and all cross the average silhouette 
score threshold indicated by the dotted red line. The right chart is a display of of the ac- 
tual data-points being formed into clusters represented in the four colors that correspond 
with the left chart. It is vital to note that the right chart has x and y axes that represent 
two out of the five total features, as it is not easy to create a five-featured visual repre- 
sentation of the clusters. However, by comparing two features, the user can still clearly 
note the distinction between clusters. Figure 10 is not optimal, with no consistently sized 


clusters and only two clusters crossing the average silhouette score threshold. 
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Silhouette analysis for KMeans clustering on sample data with n clusters = 4 


The silhouette plot for the various clusters. 


Cluster label 


-0.1 0.0 0.2 0.4 0.6 0.8 
The silhouette coefficient values 


Figure 9: Synthetic data set results of k-means-- + 


ters (figure generated by author) 
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Feature space for the 2nd feature 


75 


5.0 


2.5 
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The visualization of the clustered data. 


-10.0 -7.5 -5.0 -2.5 0.0 25 5.0 
Feature space for the 1st feature 


initialization, 5 features, and 4 clus- 


Silhouette analysis for KMeans clustering on sample data with n_clusters = 6 


The silhouette plot for the various clusters. 


Cluster label 


-0.1 0.0 0.2 0.4 0.6 0.8 
The silhouette coefficient values 


1.0 


Feature space for the 2nd feature 
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5.0 


2.5 


0.0 
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The visualization of the clustered data. 


-10.0 -7.5 -5.0 -2.5 0.0 25 5.0 
Feature space for the 1st feature 


Figure 10: Synthetic data set results of k-means+-+ initialization, 5 features, and 6 


clusters (figure generated by author) 


4.4 Graphical Presentation of Achieved Results 


For ease of visualization, the data has been displayed in the bar charts below. In all bar 
charts, the left bar (blue) has random initial placement, and the right bar (orange or 


grey) has k-means- + initial placement. 


The first row of the blue-orange charts displays the silhouette score of the data sets with 
5 features (a) and 15 features (b). Each bar represents the results of a particular number 
of clusters (x-axis) on the accuracy (indicated by silhouette score on the y-axis) of the k- 
means clustering process. The second row is the same, except the y-axis is now replaced 
by the number of iterations. The final row (of blue-grey charts) display the silhouette 
scores (a) and number of iterations (b) of a data set with a set amount of clusters when 
altering the number of features (x-axis). The aforementioned row descriptions apply to 


the 14 charts in Sections 4.4.1 and 4.4.2. All figures are generated by the author. 
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4.4.1 Graphical presentation of synthetic data set results 


Synthetic dataset with 5 features Synthetic dataset with 15 features 
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(a) Synthetic data set with 5 features (b) Synthetic data set with 15 features 


Figure 11: Silhouette score of synthetic data set 


Synthetic dataset with 5 features Synthetic dataset with 15 features 
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(a) Synthetic data set with 5 features (b) Synthetic data set with 15 features 


Figure 12: Number of iterations of synthetic data set 


Synthetic dataset with 4 clusters Synthetic dataset with 4 clusters 
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(a) Silhouette score of synthetic data set (b) Number of iterations of synthetic data 
with 4 clusters set with 4 clusters 


Figure 13: Altering the number of features of synthetic data set with 4 clusters 
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4.4.2 Graphical presentation of wine data set results 


Wine dataset with 5 features Wine dataset with 15 features 
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(a) Wine data set with 5 features (b) Wine data set with 15 features 


Figure 14: Silhouette score of wine data set 


Wine dataset with 5 features 
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(a) Wine data set with 5 features (b) Wine data set with 15 features 


Figure 15: Number of iterations of wine data set 


Wine dataset with 3 clusters 


Wine dataset with 3 clusters SE 
' Nünbirgf fes x i Number of features > 
(a) Silhouette score of wine data set (b) Number of iterations of wine data 
with 3 clusters set with 3 clusters 


Figure 16: Altering the number of features of wine data set with 3 clusters 
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Wine dataset after PCA with 3 clusters 
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Wine dataset after PCA with 3 clusters 
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(a) Silhouette score of wine data set with 3 (b) Number of iterations of wine data set with 
clusters after undergoing PCA 3 clusters after undergoing PCA 


Figure 17: Wine data set results for 3 clusters after applying PCA 


5 Data Analysis 


5.1 Analyzing Number of Clusters Using Silhouette Score 


First, this investigation has demonstrated that the silhouette score is a reliable indicator 
of choosing the optimal number of clusters for both synthetic data and real data. In 
supervised learning, there is a Ground Truth that the algorithms are aware of, meaning 
that accuracy can be measured. However, k-means clustering is an unsupervised learning 
algorithm that is learning as it runs with no Ground Truth to compare to, thus accuracy 
is much harder to measure. For the sake of testing, it was already known that the optimal 
number of clusters for the synthetic data set was 4. The results shown in Figure 11 por- 
tray this, with a peak in silhouette score at 4 clusters for both random and k-means+ + 
initial placements, whether it was data collected for 5 features or 15 features of the syn- 
thetic data set. This means that the silhouette score consistently classified 4 clusters as 
the optimal amount. Although the algorithm is not aware that this is the correct answer, 
it was already known externally, so this serves as evidence supporting the strength of 


using silhouette score as an indicator of accuracy. 
For the real data collected with the wine data set, it was externally known that the 


optimal number of clusters should be 3. To see whether the silhouette score was able to 


capture this optimal cluster value, refer to Figure 14. Although Figure 14a of the wine 
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data set with 5 features shows a clear peak in silhouette score for 3 clusters, Figure 14b 
with 15 features shows that the best silhouette score is achieved with 2 clusters. This is 
a solid example of how real data does not operate like synthetic data, where the optimal 
number of clusters is the same across all number of features used; instead, it is measured 


relatively. 


5.2 Analyzing Initialization Methods 


As seen across all results, the initial placement (random or k-means- +) has a negligi- 
ble difference in results of silhouette score, but causes varying results for the number 
of iterations. This is especially true for a larger number of clusters when referring to 
Figure 12, as the number of iterations stays relatively similar regardless of initial place- 
ment on both graphs of synthetic data sets with 1-3 clusters, but clusters 4 and 5 see a 
drastic difference with the left bar representing random initial placement and the right 
bar representing k-means+-+ initial placement. However, this pattern is not reflected 
when looking at data comparing the number of iterations of a synthetic data set with 
4 clusters and varying features, as shown in Figure 13b. Instead, a difference is seen 
between initial placements for the middle two number of features of the data set (10 
and 15), with 5 and 20 features achieving the same number of iterations for both initial 
placements represented by the blue and grey bar graphs. For the wine data set results, as 
seen in Figures 15 and 16, the initial placements have distinctly different results across 


all variables tested. 

Regardless of what pattern is seen in the differing results between initialization methods, 
all data collected across both the synthetic and wine data sets reflected that silhouette 
score was not different between experiments using different initial placements, but the 


number of iterations always varied at some point. 


Since the initialization method does cause varying results for the number of iterations, 
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its impact is to reduce the number of iterations. Across nearly all data, k-means+ + gen- 
erally outperforms random initialization with a lesser number of iterations used. This 
is especially when the number of clusters is closer to the Ground Truth known by the 
external experimenter. This saves computing power and a significant amount of time, 


especially in large scale data sets such as the wine data set. 


5.3 Analyzing the Number of Features 
5.3.1 Importance of Feature Scaling 


First, it notable that feature scaling in clustering, as different features are being com- 
pared. Otherwise, the result has the potential to be severely skewed. In this study, fea- 
tures were already scaled for the synthetic data, but had to be scaled for the wine data. 
The features within the wine data set included features like the percentage of alcohol 
and alkalinity of ash, which are two features that cannot directly be compared. Once 


scaled, the experiment proceeded with the k-means clustering process. 


5.3.2 No Direct Relationship Between Features and Clusters 


A very significant result is that the number of features has the no direct relationship with 
finding the optimal number of clusters. It is quite a common assumption that adding 
more features (including more information) will aid with clustering processes. However, 
this is not necessarily the case. In Figure 11a and 11b, a varying number of clusters 
for the synthetic data set with 5 features and 15 features was compared. As seen in 
the nearly identical results, the number of features varying between 5 and 15 did not 
create a significant impact in determining the optimal number of clusters. There are 
slightly more visible varied results between Figure 12a and 12b, comparing a synthetic 
data set with 5 features and 15 features, but the general pattern is still generally similar. 
When referring to Figure 13 depicting the silhouette scores and number of iterations of 


the synthetic data set with a varying number of features, there is no specific pattern seen. 
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When looking at the wine data set, the effect of the number of features sometimes varies 
from that of the synthetic data set. As seen in Figure 14a and 14b, the wine data set 
with 5 features deemed 3 clusters the optimal, but the data set with 15 features deemed 
2 clusters the optimal. As previously mentioned, it was already known that the true op- 
timal number of clusters was 3 for the wine data set, which means that the Figure 14a 
with 5 features was more effective at determining the true number of clusters than Fig- 
ure 14b with 15 features. When comparing Figure 15a to 15b and Figure 16a to 16b, 
there was no specific pattern seen when altering just the number of features with a set 


number of clusters. 


Therefore, simply adding more features does not necessarily improve the accuracy of 


clusters generated by k-means clustering. 


5.3.3 Analysis of Dimensionality Reduction Using PCA 


Dimensionality reduction using PCA was implemented to reduce the feature dimension- 
ality of the wine data set. The results are shown in Figure 17, the highest silhouette 
score belongs to 2 clusters, which also has lowest number of iterations when compared 
with 3 and 4 clusters. This experimental value of the optimal number of clusters matches 
the Ground Truth. Thus, dimensionality reduction is very helpful in k-means clustering, 
as this result was not consistently shown in the wine data set prior to dimensionality 


reduction. 


6 Limitations 


The first limitation of this study is that only a certain number of clusters and features are 
considered. Hence, these results can only be considered a local optimal, but not gener- 
alized to be the global optimal. Future research needs to be done in order to confirm or 


deny whether the local results reflect globally. 
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This investigation only had a Ground Truth to compare experimental results to due to 
setting up the data set externally for experimental purposes. However, since k-means 
clustering is unsupervised, no one knows what the number of clusters is - this means 
multiple must be tested (this experiment tested 5 clusters). Future research can be done 
to find a Machine Learning approach to figuring out how many clusters should be tested 


using these methods in order to find the optimal number of clusters to use. 


7 Conclusion 


In this paper, the effects of changing the number of clusters, number of features, and 
initialization methods of k-means clustering were analyzed. Logical and mathematical 


explanations for the patterns observed were also provided. 


The results prove that silhouette score is a reliable indicator of accuracy, as there was 
a Ground Truth to compare experimental results to. However, when k-means clustering 
is usually run, the Ground Truth is unknown as it is an unsupervised learning algorithm. 
Since there is a limitation that it is unknown how many clusters should be tested, re- 
searchers currently need to test multiple clusters experimentally (such as in this paper) 
to find the optimal. The amount of clusters and which clusters should be tested can 
be estimated based on the application of the algorithm. If the ultimate goal is to clus- 
ter students into different socioeconomic groups in a high school, it is likely to deduce 
from logical reasoning that the optimal number of clusters lies between 3 and 5, so a 


researcher should test the clusters within and around this range (i.e., test 2-6 clusters). 
It was found that altering the initialization method had little effect on the silhouette 
scores, but using k-means- + generally improved computational running speed with a 


lower number of iterations needed to determine the optimal number of clusters. 


The effect of altering the number of features is less predictable, as it followed no clear 
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relationship for both data sets. However, when the features underwent dimensionality 
reduction using principal Component Analysis, it was advantageous to improving accu- 


racy and speed with a higher silhouette score and lower number of iterations. 
Hopefully this paper will prove useful to Machine Learning resources in guiding their 


choices as they utilize k-means clustering, leading to more innovative training of unsu- 


pervised learning algorithms to be used through all facets of study. 
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Appendix 


The following program was used for this investigation. Different test trials of k-means 
clustering algorithms were collected with silhouette score and number of iterations as 


results. Some insight needed to write this code was drawn from Scikit-learn [9]. 


from sklearn.datasets import make. blobs 


from sklearn.cluster import KMeans 


3 from sklearn.metrics import silhouette samples, silhouette score 


[i 
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import matplotlib.pyplot as plt 


s import matplotlib.cm as cm 


import numpy as np 


import time 


o # Generating the sample data from make blobs 


X, y = make blobs( 
n. samples-1000, 
n features- 20, 
centers-4, 
cluster std-1, 
center box-(-10.0, 10.0), 
shuffle-True, 
random state-1, 
) 


# For reproducibility 


* for wine 


from sklearn.preprocessing import StandardScaler 
from sklearn.decomposition import PCA 

from sklearn.naive bayes import GaussianNB 

from sklearn.metrics import accuracy. score 

from sklearn.datasets import load. wine 


from sklearn.pipeline import make. pipeline 
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30 


31 


32 


33 


34 


35 


36 


37 


38 


39 


40 


41 


42 


43 


44 


45 


46 


47 


48 


49 


50 


51 


52 


53 


54 


55 


56 


57 


58 


59 


60 


features, target - load wine(return X y-True,as frame-True) 
# newfeatures = features.iloc[: ,0:13] 

# scaler = StandardScaler() 

# # transform data 


# X = scaler.fit transform(newfeatures) 


* for dimenstion reduction discussion 


pca = make pipeline(StandardScaler(), PCA(n.components-4)) 


X = pca.fit transform(features) 
ra printi(X) 
range_n_clusters = [2, 3, 4, 5, 6] 


for n_clusters in range_n_clusters: 
# Create a subplot with 1 row and 2 columns 
fig, (axl, ax2) = plt.subplots(1, 2) 


fig.set_size_inches(18, 7) 


# The ist subplot is the silhouette plot 

zEheusssihioueiit e cole tcqentecangranpgestromme d eles bili me bh3s 
example all 

ae Jb qWatielialen Ee 31] 

axi.set xlim([-O.1, 1]) 

# The (n_clusters+1)*10 is for inserting blank space between 
silhouette 

# plots of individual clusters, to demarcate them clearly. 


axi.set ylim([0, len(X) + (n_clusters + 1) * 10]) 


# Initialize the clusterer with n_clusters value and a random 


generator 


Hoseed Jot S10) foraxreprsoducqbu'ls ys 
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69 


70 


79 


90 


91 


clusterer = KMeans(n. clusters-n clusters, init = "k-means++", 


random state-10) 


# random state means that I set a random seed 
tO = time.time() 


cluster labels = clusterer.fit predict(X) 


t batch = time.time() - tO 


tH. Lhe sibhoulettemsicone: “aves ithe vavierage svaliules ft om valle the samples: 


# This “gives a persipective, imito the density andi separation of thie 


formed 


# clusters 
silhouette_avg = silhouette_score(X, cluster_labels) 
print ( 
“Por n clusters =" 
n_clusters, 
HN EyOwaze Saline mows Score ale el, 
silhouette_avg, 
# "Time used =", 
# t_batch, 


"Kmeans actual iterations =", 


cu sitem ena necdiboi m 


# Compute the silhouette scores for each sample 


sample silhouette values - silhouette samples(X, cluster labels) 


y-lower - 10 

for i in range(n_clusters): 
# Aggregate the silhouette scores for samples belonging to 
# cluster i, and sort them 


ith cluster silhouette values - sample silhouette values[ 


cluster labels -- i] 


ith cluster silhouette values.sort() 
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93 size cluster .i = ith cluster silhouette values.shape[0] 


94 y-upper = y.lower + size cluster i 

95 

96 color = cm.nipy spectral(float(i) / n. clusters) 

97 ax1.fill_betweenx ( 

98 np.arange(y lower, y_upper), 

99 OF 

00 ith cluster silhouette values, 

01 facecolor=color, 

02 edgecolor=color, 

03 alpha=0.7, 

04 ) 

05 

06 # Label the silhouette plots with their cluster numbers at the 
middle 

07 axi.text(-0.05, y lower + 0.5 * size cluster i, str(i)) 

08 

09 # Compute the new y lower for next plot 

0 y-lower = y_upper + 10 # 10 for the O samples 


N 


axl. set title("The silhouette plot for the various clusters. “) 
3 ax1.set_xlabel("The silhouette coefficient values") 


4 ax1.set_ylabel("Cluster label") 


6 #helverticalininel for average salhounetteksecore or sal thes yalues 


7 ax1.axvline(x=silhouette_avg, color="red", linestyle="--") 
8 

9 axi.set yticks([]) # Clear the yaxis labels / ticks 

20 axl.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1]) 


22 # 2nd Plot showing the actual clusters formed 
23 colors = cm.nipy spectral(cluster labels.astype(float) / n_clusters) 


24 ax2.scatter( 
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50 


53 


Xs, Ol, Ws, Dl, mseshere".U., ges. 


edgecolor="k" 


) 


# Labelane thie (clusters 


centers = clusterer.cluster_centers_ 


# Draw white circles at cluster centers 


ax2. 


for 


scatter ( 
centers[:, 0], 
centers[:, 1], 
marker-"o" 
c="wha te! ; 
alpha-1, 
s=200, 


edgecolor="k", 


i, c in enumerate(centers): 


ax2.scatter(c[0], c[1], marker="$%d$" 


edgecolor="k") 


set_title("The visualization of the clustered data.") 


i, 


alpha=0.7, 


alpha-1, 


c=colors, 


s=50, 


Usalhoulet te, analysas tom  KMeanis clustering won samp he idata swath 


ax2. 

ax2.set_xlabel("Feature space for the ist feature") 
ax2.set_ylabel("Feature space for the 2nd feature") 
plt.suptitle( 

nacuustense m d 


) 


fig.savefig(’figures/pca4’ + str(n clusters) + 


Jh. ini Cleese S 
fontsize-14, 


fontweight="bold", 
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i png») 


156 plt.show() 
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